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Due to the sweeping waves of global industry development, the number of 
containers passing through terminal ports increases every day. Therefore, it 
is essential to automate the identification process for the container codes to 
replace the manual identification for more efficient logistics and safer 
workplace. This paper aims to design and evaluate the performance of such a 


system. Specifically, automated container codes recognition (ACCR) has 
been implemented. This is a novel container tracking model based on image 
processing algorithms and machine learning (ML) algorithms to be applied 
in ports. There are three steps in this system: character detection, character 
isolation, and character recognition. The first step is to identify an area with 
10 digits and 26 capitals. After detecting the text area, the second step is to 
separate the characters. Each character is recognized in the last step by the 
classification method. In particular, features are extracted with the histogram 
of oriented gradients (HOG) algorithm and support vector machines (SVMs) 
for training and prediction. The trained ML model is then used to classify 
characters and digits according to what it has learned. In general, the digital 
technologies in logistics and container management in ports will benefit 
from the proposed algorithms. 
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1. INTRODUCTION 

In recent years, the global supply chain has suffered from the impact of coronavirus disease of 2019 
(COVID-19) due to strict lockdowns in many countries around the world, leading to the closure of many 
companies and factories. Most companies must adapt to this pandemic situation by equipping themselves with 
transformative technologies that could help them maintain their production lines [1], [2]. Specifically, tracking 
and tracing (T&T) systems have been implemented, i.e., the usage of barcodes and radio frequency 
identification (RFID) tags [3]. In the supply chains of vital medical products, RFID tags have been utilized to 
track and authenticate plasma, test kits, vaccines, and personal protective equipment (PPE) in [4], it was proven 
that the ability to record and exchange information in real time is of the utmost importance for supply chains in 
terms of collaboration and the ability to cope with and recover from disruptions. In the international logistic 
system, containers have undeniably become one of the most important assets for freight transport. The tracking 
and management of containers, thus, are also essential since the containers are shipped globally and frequently 
switched between different shipping vehicles. Port terminals are accustomed to making manual records on 
container codes. In fact, this requires high labor costs and leads to human errors due to fatigue from repetitive 
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work. In view of this, automatic container code recognition (ACCR) systems are deployed to automate the 
recording process at ports. All ACCR systems can be implemented relatively easily, by installing managing 
software and camera(s) at the port terminals. The system differs from another on the basis of the computational 
methods that are used to obtain information based on images [5]. A typical ACCR system is programmed with 
two steps, i.e., code localization followed by recognition. Traditional localization methods extract information 
from text areas by analyzing stroke features, gray level, edges, contours, and histogram information obtained 
after processing grayscale images [6]. Additionally, other features such as filters [7] and masks [8] can be 
utilized for the same purpose. After the codes are localized, the recognition part follows. To fully and accurately 
recognize container codes, characters are first isolated so that individual characters can be recognized. From 
previous studies of automatic license plate recognition (ALPR) can be seen that even if the recognition 
algorithm can handle different character orientations, sizes, and fonts, the results are still incorrect if the 
segmentation is set improperly [9], [10]. In addition, the segmentation process is affected by blurriness, noise, 
uneven illumination, and shadows. A typical ALPR uses optical character recognition (OCR) techniques in 
which characters are segmented from the license plate. On another hand, OCR is employed to recognize 
characters and digits in a situation where segmentation is difficult, i.e., multi-character reading. From another 
perspective, convolutional neural networks (CNNs) show their strength in dealing with multi-character contexts 
without the need to employ segmentation, especially in unconstrained images [11]-[14]. Algorithms used for 
ALPR are the templating matching method or machine learning (ML) [15], [16]. The ML methods are more 
robust, as they use more stable features, e.g., image density [17] or direction features [18]. Among the most 
common techniques used in research and practice are support vector machine (SVM) [19] and probabilistic 
neural network (PNN) [20]. In addition, in the context of scene text detection, it is worth mentioning that 
deep learning (DL) methods, e.g., convolutional repetitive neural networks (CRNN) have been exclusively 
developed [21], [22]. 

The role of digitalization in the container shipping supply chain (CSSC) was intensively discussed 
in Song [23]. In view of this, state-of-the-art technology such as DL has the potentials to improve all the major 
segments in CSSC, i.e., vessel/freight/container logistics. For the application of DL in container logistics, 
Li and He [24] has combined the DL with computational logistics to a so-called container terminal-oriented 
neural-physical fusion computation (CTO-NPFC), which is used to analyze the performance of the container 
terminal handling system (CTHS) at ports. In addition, Zhang et al. [25] introduced a highly accurate 
approach to localize and recognize the codes. Specifically, the authors deployed adaptive score aggregation 
(ASA) algorithm to remove the text regions with noises. The boundaries of the code regions were then 
identified using average-to-maximum suppression range (AMSR) algorithm. The proposed approach can 
proceed at 1.13 FPS and has the accuracy of 93.33%. Liu et al. [26] further improved the accuracy of the 
localization by a real-time ML system to predict the texts and their boundaries, consequently fuse the results 
to improve the segmentation accuracy. Their system achieved a F-measure result of 96.5% at 70 FPS while 
performed on the on-field datasets. 

In short, the repeated manual identification process at terminal ports has hindered us from moving 
forward to a more efficient logistics system. This is due to the fact that manual check requires extra labour 
cost and is time-consuming. Besides, conducting the same job everyday would likely bring up human errors, 
and direct contact with containers from all over the world could put the port workers to health risks, e.g., 
from the current COVID-19 pandemic. Hence, automating this process is undeniably a necessity. Based on the 
reviewed literature and the above discussion, we present herein a container code recognition system, which 
includes three models, i.e., segmentation, isolation, and character prediction. The recognition model and the 
insights of the techniques are introduced and explained in detail. The case study is conducted in a port in Ho Chi 
Minh City, Vietnam, and the result is highly satisfying. In addition to the introduction, section 2 discusses the 
method that we employed. Section 3 shows the results of the proposed model compared to other similar models. 
The study is concluded in section 4. 


2. METHOD 

The container code recognizing system herein this study is equipped with three primary modules, 
namely segmentation, isolation, and character prediction. Specifically, at first, areas with texts from given 
grayscale images are detected (segmentation). The lines of texts are then separated to forward to the next step 
(isolation). Finally, SVMs are deployed to determine isolated characters. It should be noted that, for each 
step, appropriate algorithms must be chosen so that the overall performance can be optimized. The flowchart 
of the recognition system is shown in Figure |. The all the modules are discussed in detail in Figure 1. 
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Figure 1. The recognition system’s flowchart 


2.1. Text area determination 

For the first module, the text region location method is deployed aiming at separating the text line 
areas from the background. The grayscale images as inputs are usually poorly lit, contain reflection, and defects, 
which would downgrade the accuracy and efficiency of the segmenting process significantly. Thus, the input 
images are pre-processed with seven steps, presented in detail from section 2.1.1 to section 2.1.6. From Figure 2 
to Figure 8, the image resulted after each step can be observed. From the original image in Figure 2, Figure 8 is 
produced as a final image that is ready for the separation and recognition of the characters. 


2.1.1. Grayscale 
First of all, a red, green and blue (RGB) image shows the back of the container being captured and 
shrunk. The center area of the image is set to the container area and grayscaled using the (1). 


L = 0.299R + 0.587G + 0.114B (1) 
Where R, G, B respectively denotes the red, green, and blue color channels of the original image. 


2.1.2. Gaussian blur 

The grayscale image contains, yet, noises stemming from different illuminating directions, lighting 
conditions, the equipment that is used to capture the image itself. These defects are blurred to reduce noise 
and details with the Gaussian function with standard distribution in the 2D space (x, y coordinates) [27] (2). 


—(x-by)? | -(y-by)? 
2cz t 2c3 ) 


Gocx,y) = a exp( (2) 


Where a is height of the curve’s peak, b, is the mean (the peak), and Chy is the variance of x and y variables. 


2.1.3. Morphological gradient 

Morphological gradient shows how the dilation and the erosion of an image are different. After 
going through a morphological gradient filter, the blurred image is sharpened. Herein, the morphological 
gradient was deployed with the help of the kernel (structuring elements) [28] (3). 


_{Oif|x|<1 
> f —oo, otherwise (3) 
Accordingly, the morphological gradient function for the grayscale image, G (f), is given. 
Gf)=fOb-fOb (4) 


Where © and © are respectively the dilation and the erosion. The symmetric short support is denoted as b. 


2.1.4. Otsu’s binarization 

After calculating the morphological gradient and the gray-level distribution of the input image, we can 
obtain a gray-leveled graph with two vertices. The first vertex represents the regions with text, and the second 
represents the background. Otsu’s method is then used for image binarization [29]. According to [30], the best 
adaptive threshold value k* for the method can be calculated at value d?, which is (5). 


d? = w (M, — m,)* + w (m — m;)? (5) 
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Where m; and m, stand for the mean value of segmentation 1 and 2 respectively. Coefficients a, and a, are 
corresponding frequencies of w, and w3. Given that m, = w,m, + w,m,, and w, + œw, = 1, the (5) can be 
rewritten as (6). 


d? = ww(mM, — m)? (6) 
The ratio wj segmentation j E {1,2} has the total probability of (7). 
Wj = diec; Pi, j = 1,2 (7) 


Where P; denotes the quotient for occurrence numbers of gray level J (J = 256) for text image. 
Therefore, we can calculate the occurrence numbers in total for all the gray levels as: 


Disb P= 1 X 


Where the performs all of points ,w,,with j € {1,2} level of average gray on j segmentation, less than or 
equal to k threshold. We pre-define a k threshold value for the gray level. If the gray level is higher than k, 
the pixel under consideration becomes black (value 1). On the other hand, if the gray level is equal to or 
lower than k, the pixel is turned white (value 0). Remarkably, the formula for calculating average of m; is (9). 


mj = Liew; | — pi (9) 
Consequently, the best k* threshold value can be calculated from the vertex of d?. 


2.1.5. Morphological close 

After using Otsu’s algorithm, the morphology is applied again in this morphological close step. 
The morphological close method will fill the small gaps, remove noise and smooth the contours of objects in the 
image. After this process, the broken pixels in the image can be repaired for better image quality. It should be 
noted that for real photo inputs, more than one pre-processing technique must be used to achieve efficient 
quality for the image-processing task. 


2.1.6. Finding bounding boxes 

We can specify the bounding boxes for the features we are interested in. Nevertheless, as shown in 
Figure 6, boxes containing no container code can also be created. These irrelevant boxes can be filtered by 
counting the nonzero pixels in the boxes, assuming that if the boxes isolate text areas, they must be filled 
with 50% of texts as a minimum. 
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Figure 4. Morphological gradient Figure 5. Otsu’s binarization 
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Figure 8. Final results 


2.2. Separation of characters in bounding boxes with texts 

To ensure maximum accuracy in character detection, we continuously assess the positioning of each 
character. By using projection in both directions (vertical and horizontal), we can make sure that each 
individual character is captured with maximum precision. This helps to increase the accuracy of the character 
recognition process and makes it easier to identify the characters quickly. 


2.2.1. Character division in text-line areas 

Owing to the fact that the characters have rigid edges and are not subjected to poor lighting 
conditions, we can divide them into a certain number of regions. Consequently, the binary edge image can be 
generated with the most optimal adaptive threshold value specifically calculated for each image. 
By evaluating the vertical projection histogram, we can obtain the width and height of the text area. Based on 
this, we can eliminate the non-character edges if the edges do not meet the standard measurement of the font 
characters. 


2.2.2. Projection method 

In this study, the projection method according to [31] is utilized after the segmentation and character 
splitting have been performed. Using the histogram method obtained from the vertical projection, we can 
obtain the horizontal limits of a particular character. If there exist two adjacent zones that are close to one 
another, the two will be merged. In a similar manner, the bottom and top limits of a character can be 
determined with the help of a horizontal projection. As a result, all individual characters can be detected. 


2.3. Character recognition 
2.3.1. Feature extraction 
The histogram of oriented gradients (HOG) is utilized to extract the desired features from a given 
image. To build a HOG vector, we have to consider four consecutive steps, that are: (i) gradient calculation; 
(ii) feature vector calculation on individual cells; (iii) block normalization; and (iv) HOG vector calculation. 
Applying the convolution for the vertical and horizontal directions, we can calculate the image 
gradients. The derivative in the Ox and Oy directions is determined. 


D,=[-1 0 1],andD,=[-1 0 1)" (10) 
The input image is assumed to be I, and its derivatives in two directions, Iy and 1,, can be calculated. 

I, =I x D,, andl, = I x Dy (11) 
Eventually, the magnitude G and direction 0 of the gradient can be computed. 


G = JIZ + FZ and 0 = arctan = (12) 
ly 
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Secondly, after the gradient calculation, we can extract the feature vector from the individual cells. 
The image under consideration is divided into blocks, each contains a predetermined number of cells, and 
each cell is a collection of pixels. If we are given a 128x128-pixel image, assuming that each cell is sized 
4x4 pixels, the image is now composed of 32x32 cells. Given another assumption that each block is sized of 
4x4 cells, then the image is composed of 8x8 blocks. In another word, 1 block contains 4 cells or 16 pixels. 

Figure 9 illustrates the block and how to process the feature vectors extracted from it. In particular, 
in each cell (red-outlined square), there are 16 pixels (green-outlined square). From each pixel, a vector with 
magnitude G and direction @ is calculated. We then classify the vectors calculated from all the pixels into a 9-bin 
histogram (HOG), as illustrated by the bar chart in Figure 9. After the classification, corresponding to 9 bins, 
we obtain 9 vectors representing the vectors in each cell. This classification process is continued for the rest of 
the cells in the block that is under consideration. After processing 16 cells in the block, we can combine 16 cells 
and multiply it by 9 representative vectors per cell, we can obtain a feature vector sized 144x1 per block. 


.”- ee ee 
Figure 9. Example of 1 block containing 4x4 (16) cells, which are classified using HOG 
As the third step, normalization is carried out on the blocks for better recognition performance. 
Specifically, based on the local histograms of the blocks, the threshold values for the intensity can be 


computed and used for cell normalization in the block. This results in a characteristic vector that is invariant 
to the lighting condition. Normalization can be obtained using the (13). 


ve 
|Ival|7+ c2 


Where v represents the vector that contains the block diagram after normalization, || v || is the normalized 
value of v, and c constant. 


v 


L2 — norm: f = , and L1 — norm: f = (13) 


Jival + c2 


Figure 10. Different characters and digits after HOG application 
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In this study, HOG is applied to process 28x28-pixel binary images of characters and digits with 
different fonts. Because the images are small, we assume that a block has the size of a cell for easier processing. 
Thus, considering that each cell is 4x4 pixels in size, each block is also 4x4 pixels. Thus, we have in each image 
7x7 blocks. After voting in the HOG diagram, we obtained one feature vector for each block. The feature vector 
of each block is normalized with the L2-normalizing method. After this normalization, we will obtain in each 
block one new feature vector. By combining the 7x7=49 feature vectors for 49 blocks, we receive 1 feature 
vector (column vector) for the entire image with the size of 49x9=441x1. Each character or digit has 1 distinct 
feature vector. For each character or digit with different formats, we can extract 1 distinct feature vector. 
The collection of feature vectors extracted from the characters and digits is used as the input for the SVM 
algorithm in the next step. Examples of the characters and digits after applying HOG can be seen in Figure 10. 


2.3.2. Training and classifying 

Datasets are either almost linearly separable or nonlinearly separable. To overcome this problem, 
kernel methods would seek an appropriate transformation to convert the non-linearly separable data to a new 
space in which the data becomes linearly separable. This is beneficial for soft margin SVM classification. 
Assuming that @() is the new function we obtain after the transformation to this new space, x data in the old 
space will become (x). Indeed, @(x) must not be calculated explicitly for each and every data point. 
Instead, we just need to calculate @(x)"@(z) for two x and z data points using the kernel trick. 


k(x,z) = x"z (14) 


It should be taken into account that we will create 36 classes in total for training and classification. 
For the application described herein, linear SVM is the most compatible. In general, SVMs are algorithms 
used to work with one or a number of hyperplanes in a high or endless dimensional space, making them 
highly applicable for classifying or, regressing tasks. A SVM classifier is a binary classifier that operates in a 
supervised manner [32]. The Char74k dataset was used here for model training because it offers a variety of 
characters formatted in different ways, which would diversify the learning of the features of the characters. 


3. RESULTS AND DISCUSSION 
3.1. Shipping container code and its meaning 

In accordance to the ISO 6346 standard, a standardized container code consists of 11 digits, of which 
the first three must be in capital letters, indicating the container’s owner. The following fourth digit classifies the 
equipment type and must also be in capital of the latter. The next six digits are the serial number of the 
container, and they can be validated using the last check digit. Thus, the text-line area must be cropped into four 
areas containing, respectively, 4, 6, 7, or 1 digit(s) with regard to the localization prediction. The obtained 
results are eventually combined into a sequence of 11 digits. The predicted sequences are then ranked according 
to the localization confidence for further evaluation. 


3.2. Dataset 

The data set herein is generated using various character fonts sizing 28x28 pixels that are binarized to 
facilitate easier training, testing, and predicting tasks. We employ 1024 images per character and modify them 
with image augmentation techniques (rotation, shear, reflection, scaling, and blurring) to generate numerous 
new images to diversify the training data set. The augmentation technique ensures that the characteristics of the 
characters are not changed and, at the same time, mitigates the overfitting problem. The modification of the 
characters (bold, italic, and bold, and italic) can be realized by changing its format, as shown in Figure 11. 

To assess the system performance, we use more than 400 pictures of containers, sizing up to 
800x600 pixels taken from real containers at port for the study. The images were captured under daylight 
conditions. Moreover, the images are with different contrast and brightness, the containers have different 
sizes and colors, and thus are located differently on the recording screen. 


Figure 11. The modification of the characters 
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We need to design some criteria for localizing and recognizing tasks. Indeed, container codes can be 
captured on the back and the top of the containers. However, it should be noted that the codes on the top are poorly 
lit and not that compact compared to the ones on the back. Thus, we train the code localization model to neglect the 
codes on the top of the shipping container. In particular, we investigate the precision, recall, f1-score, and average 
precision (AP) to assess how the code localization model performs. Theoretically, for object detection, AP can be 
deployed as area under curve (AUC). Accordingly, we can calculate the interpolated precision P;nterp (15). 


Pinterp (r) = mar (p(F)) (15) 


Where p(f) denotes the precision obtained from the measurement at the recall f. 

Then, the 11-point interpolated AP according to [29] is deployed. The system is considered to 
perform correctly if and only if the 11-digit code sequence it predicts coincides with the real codes on the 
field. A misprediction of even 1 digit is unacceptable, since the unique identification of a shipping container 
can be obtained using the exact 11 digits. 


In addition, Recall = TP/(TP + FN), Precision = TP/(TP + FP), AP = —Yire{0,0.1..1) Pinterp (Y), 


fl-score = 2(Precision X Recall)/ (Precision + Recal), and recognition accuracy (Y%) = Neorrect /N otal» 
in which TP, TN, and FN abbreviate respectively true positive, true negative, and false negative. Noorrect 
denotes the number of correct predictions over the total prediction, Nora The accuracy of the three 
consecutive modules of the recognizing model herein is listed in Table 1. 

It can be observed in Table | that the accuracy of three modules in the proposed system is relatively 
high. Furthermore, to evaluate the model performance, we created a score map using the fixed threshold method 
presented in Table 2. In the score map, the confidences (scores) above some predetermined (fixed) threshold are 
kept, and the rest is removed. It should be noted that the threshold methods would affect the location prediction; 
thus, we need to first merge the bounding boxes which are overlapping. Then, the evaluation can be carried out. 
Because of this, we need to evaluate the precision, recall, and f1-score to facilitate the comparison of the merged 
bounding box with the ground truth bounding boxes. 

As shown in Table 2, the threshold values for precision and recall are respectively 0.9 and 0.3. 
The regions with a confidence level above 0.8 are text while the regions below 0.2 are background or noise. 
Subsequently, the adaptive threshold value is set between 0.2 and 0.8. The proposed adaptive threshold can 
maintain the text proposals with a confidence level in the range of 0.2 to 0.8. As we merge the overlapped 
proposals, we can obtain a higher confidence, thus, a higher AP value, which is highly beneficial for the code 
recognition model. 


Table 1. Execution assessment of the proposed strategy 


Module Accuracy (%) 
Determine text-line areas 89.58 
Isolation 90.24 
Characters recognition 90.45 
Overall 89.35 


Table 2. Comparison of the threshold methods 


Threshold method Recall Precision AP Fl-score 
0.1 0.9351 0.9421 0.8929 0.9405 
0.2 0.9379 0.9443 0.8933 0.9407 
0.3 0.9415 0.9476 0.8929 0.9420 
0.4 0.9305 0.9466 0.8933 0.9413 
0.5 0.9469 0.9435 0.8905 0.9420 
0.6 0.9296 0.9447 0.8915 0.9417 
0.7 0.9396 0.9416 0.8925 0.9419 
0.8 0.9359 0.9447 0.8913 0.9437 
0.9 0.9301 0.9481 0.8925 0.9442 


4. CONCLUSION 

In the context that logistic systems have suffered severe disruption due to the COVID-19 pandemic, the 
application of an automatic container code recognition system is highly beneficial. First, the proposed framework 
can deal with a variety of container types and colors, as well as different types of image defects caused by different 
lighting conditions at the port. Due to this, human contacts at ports subject to COVID-19 contagion, labor costs, 
and human-made errors can be reduced. Secondly, the model learns from the training data with a varied 
character font, so it can deliver accurate results (up to 90%) when predicting the | 1-digit sequences in real-time. 
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The results of this study have connected research and practice whose approach is relatively simple, however, 
with highly accurate results. By working with the on-field datasets from ports, it shows the potentials to be 
widely applied to many other ports that would like to automate their code identification process. From the 
researcher’s point of view, we can improve the accuracy of the model by using state-of-the-art DL techniques 
in the recent and upcoming year to the three modules of the proposed model. Future research on this topic 
can focus on capturing the images of containers while they are moving, under different lighting conditions 
(rainy days, nighttime), from different cameras and with varied angles. The size of the datasets can also be 
increased so that the model has more data to learn from to improve its predicting accuracy. By and large, 
automation of code recognition in the port is just a small part of the smart logistics, whose back-bone 
technologies can be applied to the development of smart cities in the near future. 
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